Identifying key soil characteristics for Francisella tularensis classification with optimized Machine learning models

Francisella tularensis (Ft) poses a significant threat to both animal and human populations, given its potential as a bioweapon. Current research on the classification of this pathogen and its relationship with soil physical–chemical characteristics often relies on traditional statistical methods. In this study, we leverage advanced machine learning models to enhance the prediction of epidemiological models for soil-based microbes. Our model employs a two-stage feature ranking process to identify crucial soil attributes and hyperparameter optimization for accurate pathogen classification using a unique soil attribute dataset. Optimization involves various classification algorithms, including Support Vector Machines (SVM), Ensemble Models (EM), and Neural Networks (NN), utilizing Bayesian and Random search techniques. Results indicate the significance of soil features such as clay, nitrogen, soluble salts, silt, organic matter, and zinc , while identifying the least significant ones as potassium, calcium, copper, sodium, iron, and phosphorus. Bayesian optimization yields the best results, achieving an accuracy of 86.5% for SVM, 81.8% for EM, and 83.8% for NN. Notably, SVM emerges as the top-performing classifier, with an accuracy of 86.5% for both Bayesian and Random Search optimizations. The insights gained from employing machine learning techniques enhance our understanding of the environmental factors influencing Ft’s persistence in soil. This, in turn, reduces the risk of false classifications, contributing to better pandemic control and mitigating socio-economic impacts on communities.

1. Introduction and Dataset: Introduce a unique soil feature dataset for Ft +Ve and −Ve sites, consisting of 21 soil characteristics.

Material and methods
The research employs a systematic approach, starting with the ranking of soil features through various techniques such as SVM attribute evaluator, ReliefF, Chi-Square, and Gini-Index algorithms.Following feature ranking, a nested classification methodology is implemented.This involves iteratively selecting the top-ranked features and applying them to optimize classifiers through hyperparameter optimization techniques.The nested classification approach allows for a stepwise refinement of the model, ensuring that the classifiers are tailored to the most relevant features.This sequential strategy, illustrated in The Fig. 1, aims to enhance the robustness and predictive accuracy of the classification model.

Sample acquisition and analysis
The study was conducted in Punjab province, recognized for its predominant agricultural setting and substantial human and livestock populations.Employing a three-stage sampling design, we selected districts representing key livestock production areas with heightened annual disease incidence.Locations across the provience, including livestock barns and agricultural land, identified as Ft positive, underwent soil chemistry analysis.An equivalent number of locations where Ft genome was not detected were also selected to explore the relationship between soil parameters and bacterial persistence.For soil genome detection, we adhered to a previously optimized and validated real-time PCR protocol targeting the tul4 gene 23 , incorporating necessary controls.Soil samples were analyzed using optimized protocols for pH, moisture, texture, total soluble salts, and various elements.Detailed methodologies for the analyses can be found in the cited references [24][25][26][27][28][29][30][31] .These physicochemical soil features have different range of values, as displayed in Table 1.The implementation of proper personal protective equipment (PPEs) were ensured during expirementation to maintain biosafety standards.A concise overview of soil sampling, genome extraction, detection, Ft distribution, and soil chemistry analysis is available in our prior research 32 .

Attribute selection
Data filtering is important for constructing an accurate and efficient model that can enhance performance.These models assist us in selecting the optimal set of features for analysis.If 21 input attributes are selected from the soil attribute dataset, the attribute matrix, represented by X em =[X 1m , X 2m , X 3m ,..., X Em ], consists of E column vectors, and x em is a specific feature value (with e= 1, 2, 3, 4, . . .E and m= 1, 2, 3, 4, . . .M; where E=21 and M=148 in the dataset).

Attribute selection models
A feature selection algorithm incorporates a search procedure to recommend new feature subsets with evaluation criteria that assign different scores to various features 33 .The most appropriate model is the one that tries every likely subset of features and uncovers the most suitable subset that decreases the error rate.Yet, the exhaustive search technique becomes unviable in more comprehensive feature space scenarios.The selection of evaluation metrics greatly influences the procedure.Various feature-selection models have been employed, for example, Support Vector Machine (SVM) attribute evaluator, ReliefF (RLF), Chi-Square (Chi-Sq) and Gini-Index (GI).
The feature ranking models are explained as under:

SVM attribute evaluator
This attribute evaluator assesses a feature's worth by using SVM.The features are ranked by the SVM's square of the weights approach.Feature ranking for multiclass scenarios is managed by ranking each class separately, employing a one-vs-all approach, and then dealing from the top of each pile to suggest a final rank.

ReliefF
The ranking model's main idea is to assess the attributes' quality by their capability to differentiate among samples of different classes in a local neighborhood.So the most relevant attributes are those that contribute more to increasing the distance between different class samples while contributing less to increasing the distance between the same class samples 34 .The equation for weight updation using RLF is shown as under: Where W z represents the weight for attribute Z, E is a randomly sampled instance, C h and C m represent the closest hit and closest miss, respectively, and n is the number of randomly sampled instances.The diff() function calculates the difference between two instances for a given attribute.For nominal attributes, it is defined as 0 if the values are the same and 1 if the values are different.For continuous features, the actual difference is normalized to the interval 0,1.Dividing the formula by n ensures the weights are within the interval -1,1.RLF is sensitive to attribute interactions and aims to estimate the change in probability for the weight of feature Z as defined in equation ( 2).
( Where e represents degree of freedom, OF (Observed frequency) is the number of instances of a class, EF (Expected frequency) if the number of expected instances of class if there is no association betweeen the targer and attribute.

Gini-index
Gini-Index (GI), called Gini impurity, estimates the probability of a particular attribute being misclassified when picked randomly.It can be called pure if all the components are associated with a single class.GI ranges between values one and zero, where zero represents the purity of classification, i.e., all the components represent a specific class or only one class exists.Moreover, 1 demonstrates the random distribution of components across different classes.However, 0.5 displays an equal distribution of components over distinct classes.The GI is calculated by subtracting the aggregate of the squared probabilities of a class from 1.The GI can be represented as follows: Where P a exemplifies the likelihood of an element that is classified for a distinct class.

Hyperparameter optimization
Model optimization is one of the toughest challenges in implementing machine learning solutions.Finding appropriate hyperparameters is crucial for models.However, setting these hyperparameters to achieve good results takes time and effort.This problem is generally known as global optimization.The function can be stochastic or deterministic, meaning it can return different results when evaluated at the same point.An revolution in BO is the acquisition procedure, which the technique employs to choose the successive points to assess.The acquisition procedure can stabilize sampling at positions with low-modeled objective functions and explore areas that still need to be modeled well.The Optimization function internally retains a Gaussian process (GP) that uses the objective procedure estimations to train the model.The GP equation is given as under: Given observations D = (Y, f ) we can condition our distribution on D as usual: How do we pick where to observe the function next for a given set of observations?A strategy in BO is to devise an acquisition function a(y).It is a cost-effective estimate calculated at a particular point, based on the anticipated benefit of evaluating f at y in the minimization problem.The optimization of the acquisition function is used to determine the location of the next observation.In essence, we have substituted the original optimization problem with another optimization problem, but one that operates on a much cheaper function a(y).

Machine learning classifiers
In this section, we outline the various machine learning classifiers utilized in our study, including Support Vector Machine (SVM), Ensemble model (EM), and Neural Networks (NN) for training the proposed model.

SVM
SVM performs multi-class classification tasks by drawing a hyperplane to maximize the margin among classes.
The classifier also tries to minimize the error 35 , and it provides different advantages like a sufficient generalization to the new instances, the absence of local minimums, and a representation that relies on a few features 36 .Given a training set of input vectors x i ∈ R d , i = {1, . . ., N t } for d dimensional input space and outputs y i ∈ {1, −1} .
Where equation 9 shows the SVM's hyperplane: In the above equation, x describes the input vector, and w is for a constant vector of an SVM hyperplane.While the training input vector x i illustrates the attributes and sign() is a signum function with ±1 output.The goal is to minimize Equation 10.
Where C b represents the box constraint and ζ i disciplines objective function for samples that cross a specific margin that signifies a particular class.

Ensemble model
An ensemble is a predictive method that comprises a weighted combination of numerous classification models.
In general, fusing numerous classification models improves the performance enormously.

Neural networks
A NN comprises a feed-forward and backpropagation network, which includes three types of layers: an input layer, an output layer, and a hidden layer.Each layer in the network has a specific role to play.The input layer accepts the input data, while the output layer carries out key functions such as prediction and classification.The hidden layers are the true workhorse of the model, executing the majority of the computation between the input and output layers.The backpropagation technique optimizes the weights of these layers.These models are used for classification, recognition, approximation, and prediction tasks and are effective for solving non-linearly separable problems.The computations taking place at each neuron in the hidden and output layer are as under: Let W(1), W(2) represent the weights and B(1), B(2) be the biases of the previous and next layer.The output of the previous layer, z, is multiplied with the weights of the current layer, W(1), to form the inner vector product.Then, a bias vector B( 1) is added, and the result is fed into the activation function r 1 () .The activation functions r 1 , r 2 are used to introduce non-linearity into the model.The various activation functions are {r 1 , r 2 } .The mostly applied activation functions are sigmoid and tanh , where sigmoid is shown as

Experiments Data description
All the feature-ranking and hyperparameter optimization experimentations on machine learning models are performed on the F. tularensis soil attribute dataset, comprising 148 samples.Each sample consists of 21 soil features.A supervised dataset is required to prepare a predictive model for classification.So, we assigned label "A" to positive samples, and "B" to negative soil samples in the dataset.

Software tool and performance measures
We use MATLAB for experimentation on the Ft soil attribute dataset for hyperparameter optimization of classification models and feature ranking.Initially, we load the dataset to the workspace, and then a 10-folds validation scheme is applied, which measure's a model's accuracy.Once the app has loaded the data, we can choose from several feature selection algorithms available in MATLAB for feature ranking.Next, we choose models that can be optimized for accuracy calculation by picking the top-ranked features from the dataset sequentially using the nested subset method.These models adjust their parameters automatically by testing various hyperparameter combinations through an optimization process.The objective of this process is to minimize classification errors or costs.The accuracy of the model can be viewed in the history panel, and its classification errors can be seen by clicking on the confusion matrix icon in the plot section.

Hyper-parameters for classifiers
In this section, we outline the key hyperparameters employed for the classifiers, including Support Vector Machines (SVM), Ensemble models, and Neural Networks, during the experimental phase.

SVM implementation
The MATLAB implementation of the SVM model underwent comprehensive parameter optimization to enhance overall performance.The key parameters considered included the box-constraint level, kernel scale, data standardization, multiclass function, and kernel type.The box-constraint level, influencing the balance between smooth decision boundaries and accurate classification of training points, was fine-tuned to 780 in the optimized SVM model.A Gaussian kernel was specifically chosen to shape the decision boundary, with the kernel scale meticulously set to 16.3794 for optimal performance.Various kernel functions, including Gaussian, Linear, Quadratic, and Cubic, were explored.Data standardization was implemented to ensure consistency in input feature scaling.The multiclass function, offering the choice between One-vs-All or One-vs-One, was tailored to a one-vs-one configuration for multi-class scenarios.These optimizations aimed to strike a balance in decision boundary smoothness and accurate classification, with the chosen configurations contributing to the robustness of the SVM model.

Ensemble model implementation
The implementation of the Ensemble model in MATLAB underwent a thorough optimization process for key parameters, each playing a crucial role in shaping the model's overall performance.The number of learners, pivotal for balancing complexity and computational efficiency, was optimized within the range of 10-500, ultimately set to 22.The maximum number of splits, ranging from 1 to 147, was meticulously tuned to 4, enhancing the model's capacity to capture intricate dataset relationships.Similarly, the number of predictors to sample underwent optimization within the range of 1-14, with the final value set to 14, striking a balance between diversity and efficiency during the learning process.The learning rate, critical for optimization convergence, was fine-tuned within the range of 0.001-1, with the optimized value set to 0.95019.Various ensemble types, including Ada-Boost, RUSBoost, LogitBoost, GentleBoost, and Bag, were explored, with AdaBoost yielding the most effective results.This comprehensive parameter configuration ensures the robustness and optimal predictive capabilities of the Ensemble model.

Neural network implementation
The implementation of the neural network in MATLAB involved the optimization of several key hyperparameters, each exerting a significant impact on the overall performance of the model.The number of fully connected layers, ranging from 1 to 3, was explored, with the optimal configuration determined as two layers.The size of each layer, including the first, second, and third layers, varied between 1 and 300.For optimal results, the number of neurons in the first layer is set to one, and in the second layer is set to two.The regularization strength (Lambda) played a crucial role, with a range from 6.7568 e−08 to 675.6757, and the optimized value was set to 0.01174.Data standardization, configurable as either true or false, was implemented to ensure consistency in the scale of input features, contributing to the robustness of the neural network model.Activation functions, including ReLU, Tanh, and Sigmoid, were explored, with the Tanh function identified as the most effective.These meticulous configurations collectively aimed to achieve optimal performance and reliability in the neural network model.
Similarly, as shown in Table 2, when examining the 11 attributes contributing the least, five of them -Potassium (K), Calcium (Ca), Chromium (Cr), Copper (Cu), and pH-persist across all feature-ranking models.
The Fig 2 illustrates the outcomes of three distinct feature ranking algorithms: Chi-Square, ReliefF, and Gini-Index.In the Chi-Square algorithm, Clay emerges as the most influential feature with a substantial weight of 16.81.Silt and Nitrogen follow closely with weights of 8.30 and 8.16, emphasizing their significant contributions to the classification.Conversely, Copper and Potassium are identified as the least significant features, each receiving minimal weights of 0.18 and 0.20.The ReliefF algorithm corroborates the significance of Clay, ranking it as the most important soil feature with a weight of 0.217.Following Clay, Soluble Salts and Phosphorus exhibit weights of 0.161 and 0.106, respectively.Notably, Potassium and pH emerge as the least significant features with weights of −0.090 and −0.073 .Similarly, the Gini-Index algorithm underscores Clay as the most crucial feature, assigned a weight of 0.35798.Nitrogen and Organic Matter follow closely with weights of 0.41617 and 0.42391, respectively.On the other hand, Potassium and Copper are identified as the least significant features, each with weights of 0.48966 and 0.48734.These weights offer a quantitative measure of each feature's impact, facilitating the identification of key contributors and less influential variables in the context of pathogen prevalence in soil.
Next, we perform a two-stage attribute ranking to assess each feature's impact on the prevalence of Ft in soilrelated environments.Initially, various feature-ranking approaches are employed to rank soil features, followed by     Finally, we present our proposed SVM classifier, which was optimized using bayesian optimization technique to generate F-1 Score of 86.5% and accuracy of 86.5%.The details of training results, models details, optimized hyperparameters, and optimizer options are shown in the Table 6.The Fig. 5 depicts the confusion matrix, assessing the performance of the optimized SVM classifier in distinguishing between Class A (Positive) and Class B (Negative).The matrix involves a total of 148 instances, evenly distributed between the positive and negative classes, each comprising 74 instances.Among the 74 positive instances, 64 are correctly classified (True Positives-TP) as Class A, while 10 instances are misclassified (False Negatives-FN) as Class B. Similarly, out of the 74 negative samples, 64 instances are correctly classified (True Negatives-TN) as Class B, with 10 instances being misclassified (False Positives-FP) as Class A. A good classifier has a dominantly diagonal confusion matrix since most of the predictor variables matched the actual labels with only a few off-diagonal numbers that indicate confusion between classes, as is visible in the case of our presented optimized SVM model.The Fig. 6 error plot for the SVM model provides a visual representation of  the classification error analysis.In the plot, the estimated minimum classification error is depicted by light blue circler points, while the observed minimum classification error is represented in dark blue points.The orange box highlights the hyperparameters associated with the best-performing point, indicating the configuration that yielded optimal results during the training process.Additionally, the yellow circle signifies the hyperparameters corresponding to the minimum observed error, pinpointing the configuration where the SVM model achieved its highest accuracy.his graphical representation aids in identifying the effectiveness of different hyperparameter settings, allowing for a nuanced understanding of the model's performance and guiding the selection of optimal configurations for future experiments.
The Figs. 7 and 8 exhibit the change in the classification performance of algorithms as the number of attributes is altered while using different hyperparameter optimization techniques.Figure 7 displays the performance of classifiers using RLF and BO strategies.For the same feature set, NN generates more promising results than  www.nature.com/scientificreports/other classification models for the initial set of features.However, these models show similar results for midlevel features.SVM surpasses other models for the last few attributes.The outcomes illustrate that overall SVM yields the best results by generating an accuracy of 86.5%.So, the overall performance of SVM is far better than other machine learning classifiers The Fig. 8 shows the accuracy of machine learning models for RLF using RS technique.For the initial set of features all the machine classifiers seem to generate similar resutls better results.However, SVM surpasses all the classification models for mid and final-level features by generating a classification accuracy of 86.5%.
In summary, the results propose that: 1.While assessing the top 10 features, the 5 most contributing features common among all are {Cy, N, SS, Si, Zn}.
2. The 5 least significant features for Ft are { K, Ca, Cr, Cu, pH}.
3. Hyperparameter optimization using BO produces better outcomes than other optimization techniques.4. SVM is the best performer among classification models.5. SVM achieves the best classification accuracy of 86.5% for the first 15 soil features {Cy, N, SS, Si, OM, Zn, Pb, Mn, Mg, Ni, Ms, Cd, Si, pH, Cr} using BO and RS. 6.For multi-dimensional data, optimizing the parameters of machine learning models can significantly improve performance by using hyperparameter optimization techniques.Therefore, the selection of correct hyperparameters is essential for yielding good classification results.

Comparative analysis with prior machine learning techniques
Few recent works applied machine learning for classifying various soil-borne pathogenic bacteria like F. tularensis and C. burnetiia; and the conditions that support their sustenance in soil, as exhibited in Table 7. But, our presented design uses hyperparameter tuning with two-stage attribute-ranking on a new F. tularensis dataset, contrary to previous research.

Discussions
Machine learning models are applied as a benchmark in various fields, like, disease diagnosis [38][39][40][41] bio-informatics 42 , medical science 43 , agriculture 44 , and soil classification 45 .Our work reveals that these models, rather than current statistical techniques demonstrate outstanding results for the classification of F. tularensis and learning its behavior in soil settings.
The results highlight the significance of specific soil characteristics for the survival of F. tularensis, as illustrated in Table 3.Previous analyses have consistently pointed to abiotic factors, such as organic matter, clay, and various micro-nutrients, as primary drivers of bacterial communities in soil [46][47][48][49] .Moreover, these factors positively correlate with the prevalence of soil-borne pathogenic bacteria [50][51][52] .Clay and silt, known for their increased surface area, are suggested to contain a significant amount of organic matter, potentially fostering the existence of bacteria 53 .Recent studies 16,17,37,54 also emphasize the importance of soil's physical and chemical properties, including clay, nitrogen, soluble salts, silt, organic matter, zinc, lead, and nickel, for the persistence of F. tularensis, C. burnetii, and B. anthracis.
Our investigation underscores clay as the most influential attribute for the presence of F. tularensis in soil, aligning with previous works 16,32,52 .Subsequent crucial attributes contributing to the sustenance of the bacterial pathogen include nitrogen, soluble salts, silt, organic matter, and zinc.Organic matter is established as beneficial for bacterial survival in soil settings 16,51,52 , while nitrogen is crucial for the persistence of pathogens within their hosts 55 .Zinc, soluble salts, organic matter, and nitrogen are identified as related to the survival of F. tularensis in the soil 16,32,56 .Zinc, in particular, plays a role in various cellular operations, including metabolism, gene expression, pH regulation, glycolysis, DNA replication, and amino acid synthesis 57 , with excess zinc potentially inducing toxicity 58 .Recent works 32,54 suggest a positive association between soluble salts and the prevalence of F. tularensis and C. burnetii.Additionally, studies 56,59 indicate that organic matter and nitrogen are associated with the prevalence of A. brasilense and C. burnetii.
The remaining contributing features from Table 3 include lead, manganese, magnesium, and nickel.Our results align with studies 16,22,32 that establish positive correlations between attributes such as manganese, magnesium, lead, and nickel and F. tularensis in soil.Organic matter, manganese, and magnesium are associated with B. anthracis, and magnesium is linked to the prevalence of C. burnetii in soil 17 .Magnesium also contributes to bacterial survival during starvation and cold shocks 60 .
Our study also reveals that cadmium, moisture, sand, and pH play intermediary roles.Earlier works [47][48][49] stress the importance of pH, soil texture, and soil nutrients for microbial communities.Recent analysis 22 supports a positive association between F. tularensis and cadmium, pH, and moisture in soil environments.Another work 61 suggests F. tularensis is associated with low temperature and moisture, emphasizing the pathogen's affinity for these conditions.Univariate analysis 54 shows significant differences among C. burnetii positive and negative soils for pH, nitrogen, magnesium, soluble salts, and organic matter.
Our results indicate that the least contributing soil attributes, as shown in Table 4, include potassium, calcium, copper, sodium, iron, phosphorus, and chromium.This aligns with recent findings 22 displaying no substantial differences between F. tularensis negative and positive sites concerning copper, sand, iron, calcium, phosphorous, chromium, and sodium in the soil.Conversely, B. anthracis and C. burnetii exhibit positive affinities to copper, chromium, cobalt, cadmium, sodium, iron, calcium, and potassium 17 .Additionally, research 19 suggests sodium and potassium facilitate F. tularensis growth in water and soil.Recent research 54 shows no substantial differences among Coxiella positive and negative sites related to copper, chromium, iron, and phosphorus in the soil.Analysis 16 and similar work 32 indicate that soil features like copper, chromium, phosphorus, iron, sodium, potassium, and calcium do not exhibit any affiliation with F. tularensis.Nonetheless, other studies 62 acknowledge that the aerobic heterotrophic community is sensitive to various nutrients, including zinc, cadmium, chromium, mercury, manganese, nickel, and copper.
Comparing our current findings with our previous publication on F. tularensis using machine learning, we observe a slight variation in the sequence of the most significant factors.In the current work, the order of significance is clay, nitrogen, soluble salts, silt, organic matter, and zinc.However, in our previous work, the sequence was clay, nitrogen, organic matter, soluble salts, zinc, and silt.Similarly, when examining the sequence of least significant factors in the current research, we find potassium, calcium, copper, sodium, iron, and phosphorus to have the least impact.In contrast, our earlier work identified potassium, phosphorus, iron, calcium, copper, chromium, and sand as the least influential.The observed shift in sequence can be attributed to the adoption of a more effective ranking methodology in which features are evaluated based on the accumulative weighted score of all methods.This refined approach allowed us to discern a more nuanced order of significance among the key factors influencing the survival of F. tularensis in soil.Furthermore, the implementation of hyperparameter optimization played a pivotal role in enhancing accuracy, leading to an improvement of over 2% compared to our previous work.The meticulous fine-tuning of hyperparameters contributed to a more robust and accurate machine learning model, thereby reinforcing the reliability of our current findings.

Conclusion and future works
In summary, our study delves into the outcomes of various attribute-ranking methods, comparing their rankings across different classifiers optimized with hyperparameter optimization techniques using Ft positive and negative soil datasets.Beyond the specific case study, our findings underscore the significance of key soil features, with clay emerging as the top-ranked attribute, followed by nitrogen, soluble salts, silt, organic matter, and zinc.The application of Bayesian optimization (BO) demonstrates exceptional results in hyperparameter optimization techniques, contributing to the robustness of our models.Specifically, Support Vector Machine (SVM) stands out as the most effective classifier, achieving an impressive accuracy of 86.5% when considering the first 15 soil features {Cy, N, SS, Si, OM, Zn, Pb, Mn, Mg, Ni, Ms, Cd, Si, pH, Cr} with BO and random search (RS).Expanding www.nature.com/scientificreports/beyond SVM, our study explores alternative models such as {BO+NN} and {RS+NN}, showcasing noteworthy classification accuracies of 83.8% and 83.1%, respectively.These models, utilizing 16 and 15 soil attributes, offer valuable insights into understanding the contribution of specific soil features to the prevalence of bacterial pathogens in soil-related environments.
While our investigation provides crucial insights into the interplay between soil characteristics and pathogen prevalence, it is essential to acknowledge that the size of our dataset is limited.In subsequent studies, we aim to enhance the robustness of our findings by expanding the geographical scope of our dataset.Specifically, we plan to explore additional districts within Punjab and extend our investigation to encompass other provinces in the country.By doing so, we aspire to gather a more extensive dataset that encapsulates the diversity of soil characteristics across different regions.This geographical expansion will not only contribute to a more comprehensive understanding of the interplay between soil attributes and pathogen prevalence but also facilitate the development of machine learning models that are more adaptable and representative of diverse environmental conditions.

Figure 1 .
Figure 1.Different stages of Francisella tularensis feature-ranking, classification and optimization.

Figure 4 .
Figure 4. Least-ranked features for Francisella in soil.

Figure 5 .
Figure 5. Confusion matrix for proposed SVM model for Ft classification.

Figure 6 .
Figure 6.Bayesian optimization error plot for proposed model.

Figure 7 .
Figure 7. Performance of different classifiers using Bayesian Optimization.

Table 1 .
Range of different physicochemical soil characteristics.Sq) is employed for categorical attributes in a dataset.We calculate Chi-Sq between each feature and the target class and pick the expected number of attributes with the best Chi-Sq scores.A high score reveals that the corresponding feature is essential.The technique decides if the sample's relationship between two categorical variables would reflect their natural association in the population.The Chi-Sq score is shown as follows : 2 n Vol.:(0123456789) Scientific Reports | (2024) 14:1743 | https://doi.org/10.1038/s41598-024-51502-zwww.nature.com/scientificreports/

Table 2 .
Attribute-ranking for Ft in soil using various attribute selection methods.

Table 3 .
Index of best-ranked features for Francisella in soil.

Table 4 .
Index list of least-ranked features for Francisella in soil.

rk(SVM) rk(Chi-Square) rk(Gini-Index) Ranking score of each attribute
The attribute with the most impact for RLF is Cy.Using this attribute, SVM, EM, and NN achieve accuracies of 77%, 75%, and 75%, respectively, and 73.6%, 77%, and 73% for Bayesian optimization (BO) and Random Search optimization (RS).The results in Table5reveal several key findings:1.The two optimization techniques yield different results for various classification models.2. For both optimization techniques, SVM achieves an accuracy of 86.5% for 15 soil features.3. The performance of different classification models is inherently arbitrary: (a) (BO+SVM, 86.5%)

Table 5 .
A Comparative analysis of for different optimization techniques against different Machine learning classifiers using ReliefF attribute selection method.The results suggest that the BO optimization technique yields more favorable outcomes for classifiers like SVM, EM, and NN compared to RS. 5. SVM outperforms other classifiers for both BO and RS. 6. BO+SVM produces the best classification results for the 15 soil features: Cy, N, SS, Si, OM, Zn, Pb, Mn, Mg, Ni, Ms, Cd, Si, pH, Cr. 7. Other models, such as BO+NN and RS+NN, also generate noteworthy results of 83.8% and 83.1%, utilizing 16 and 15 soil features, respectively.

Table 6 .
Details of Results, Optimized hyperparameters, and optimizer for proposed SVM model.

Table 7 .
A comparative analysis with prior machine learning techniques.